Datavisualisatie Draft


Groep F
18/06/2024

Part A

In [24]:
# Installing necessary packages
!pip install seaborn kaleido plotly matplotlib scikit-learn

# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import zipfile
import matplotlib.pyplot as plt
import kaleido
import plotly
import plotly.express as px
import json
from IPython.display import Image
from urllib.request import urlopen
from sklearn import datasets, linear_model

# Displaying versions to ensure correct installation
print("Kaleido version:", kaleido.__version__)
print("Plotly version:", plotly.__version__)
Requirement already satisfied: seaborn in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (0.13.2)
Requirement already satisfied: kaleido in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (0.2.1)
Requirement already satisfied: plotly in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (5.22.0)
Requirement already satisfied: matplotlib in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (3.9.0)
Collecting scikit-learn
  Downloading scikit_learn-1.5.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (11 kB)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from seaborn) (2.2.2)
Requirement already satisfied: tenacity>=6.2.0 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from plotly) (8.3.0)
Requirement already satisfied: packaging in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from plotly) (24.1)
Requirement already satisfied: contourpy>=1.0.1 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (4.53.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: pillow>=8 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from matplotlib) (2.9.0.post0)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.13.1-cp312-cp312-macosx_12_0_arm64.whl.metadata (60 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.6/60.6 kB 1.5 MB/s eta 0:00:00
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: pytz>=2020.1 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: six>=1.5 in /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Downloading scikit_learn-1.5.0-cp312-cp312-macosx_12_0_arm64.whl (11.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.0/11.0 MB 1.8 MB/s eta 0:00:0000:0100:01
Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 kB 1.7 MB/s eta 0:00:00a 0:00:01
Downloading scipy-1.13.1-cp312-cp312-macosx_12_0_arm64.whl (30.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30.4/30.4 MB 1.7 MB/s eta 0:00:0000:0100:01
Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.5.0 scipy-1.13.1 threadpoolctl-3.5.0
Kaleido version: 0.2.1
Plotly version: 5.22.0

Mijn twee gekozen datasets:

https://www.kaggle.com/datasets/mikejohnsonjr/united-states-crime-rates-by-county/data
Deze dataset bevat misdaadcijfers per county in de Verenigde Staten. Het heeft kolommen zoals, diefstal, verkrachting, moord, inwoners en county-namen. De dataset laat gedetailleerde informatie over verschillende misdaadtypes en de bevolkingsomvang zien, wat nuttig is voor criminologisch onderzoek en beleidsvorming.

number of instances: 3136
number of attributes: 24

variables:

  • county_name (nominaal, object, discreet, 0 missing values)
  • crime_rate_per_100000 (ratio, float64, continuous, 0 missing values)
  • ROBBERY (ratio, int64, discreet, 0 missing values)
  • MURDER (ratio, int64, discreet, 0 missing values)
  • population (ratio, int64, discreet, 0 missing values)

Question to explore:
Is er een verband tussen de populatie van een county en de frequentie van bepaalde misdaden zoals verkrachtig en diefstal?

https://www.kaggle.com/datasets/muonneutrino/us-census-demographic-data
Deze dataset bevat gegevens per county in de Verenigde Staten. Het heeft kolommen zoals, waar mensen vandaan komen (asian, hispanic), werkeloos, thuiswerkenden, totale populatie en inkomen per persoon. Het laat vooral sociaaleconomische en demografische cijfers zien.

number of instances: 3142
number of attributes: 37

variables:

  • County (nominaal, object, discreet, 0 missing values)
  • Employed (ratio, int64, discreet, 0 missing values)
  • Men (ratio, int64, discreet, 0 missing values)
  • Hispanic (ratio, float64, continuous, 0 missing values)
  • TotalPop (ratio, int64, discreet, 0 missing values)

Question to explore:
Hoe verschilt de werkgelegenheidssituatie tussen verschillende etnische groepen, zoals hispanic, asian, black, etc. in verschillende counties in de Verenigde Staten.


https://www.openintro.org/data/?data=county_complete
https://www.kaggle.com/code/stefancomanita/american-statistics-visualized-on-maps-w-plotly

● Relevant (based on what you were taught in class) descriptive statistics for the above chosen 5 variables. Exclude missing values when calculating descriptive statistics. You do not have to report Kurtosis and Skewness.

In [6]:
zip_file_path = r'/Users/jetzeeveleens/Downloads/crime.zip'

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    file_list = zip_ref.namelist()
    print(file_list)
    csv_file_name = 'crime_data_w_population_and_crime_rate.csv'
    zip_ref.extract(csv_file_name, '/tmp')

csv_file_path = '/tmp/' + csv_file_name
crime = pd.read_csv(csv_file_path)

display(crime)
['crime_data_w_population_and_crime_rate.csv']
county_name crime_rate_per_100000 index EDITION PART IDNO CPOPARST CPOPCRIM AG_ARRST AG_OFF ... RAPE ROBBERY AGASSLT BURGLRY LARCENY MVTHEFT ARSON population FIPS_ST FIPS_CTY
0 St. Louis city, MO 1791.995377 1 1 4 1612 318667 318667 15 15 ... 200 1778 3609 4995 13791 3543 464 318416 29 510
1 Crittenden County, AR 1754.914968 2 1 4 130 50717 50717 4 4 ... 38 165 662 1482 1753 189 28 49746 5 35
2 Alexander County, IL 1664.700485 3 1 4 604 8040 8040 2 2 ... 2 5 119 82 184 12 2 7629 17 3
3 Kenedy County, TX 1456.310680 4 1 4 2681 444 444 1 1 ... 3 1 2 5 4 4 0 412 48 261
4 De Soto Parish, LA 1447.402430 5 1 4 1137 26971 26971 3 3 ... 4 17 368 149 494 60 0 27083 22 31
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3131 Ohio County, IN 0.000000 3132 1 4 762 6084 0 2 1 ... 0 0 0 2 2 0 0 5994 18 115
3132 Newton County, MS 0.000000 3133 1 4 1465 21545 3346 3 1 ... 0 0 0 4 0 1 0 21689 28 101
3133 Jerauld County, SD 0.000000 3134 1 4 2424 2108 2108 1 1 ... 0 0 0 1 3 1 0 2066 46 73
3134 Cimarron County, OK 0.000000 3135 1 4 2167 2502 2502 2 2 ... 0 0 0 1 2 0 0 2335 40 25
3135 Lawrence County, MS 0.000000 3136 1 4 1453 12714 0 1 1 ... 0 0 0 0 0 0 0 12514 28 77

3136 rows × 24 columns

2e dataset

In [7]:
zip_file_path = r'/Users/jetzeeveleens/Downloads/census.zip'
csv_file_name = 'acs2017_county_data.csv'

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extract(csv_file_name, '/tmp')

csv_file_path = '/tmp/' + csv_file_name
census = pd.read_csv(csv_file_path)

census = census.drop(census[census["State"] == "Puerto Rico"].index)
display(census)
CountyId State County TotalPop Men Women Hispanic White Black Native ... Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment
0 1001 Alabama Autauga County 55036 26899 28137 2.7 75.4 18.9 0.3 ... 0.6 1.3 2.5 25.8 24112 74.1 20.2 5.6 0.1 5.2
1 1003 Alabama Baldwin County 203360 99527 103833 4.4 83.1 9.5 0.8 ... 0.8 1.1 5.6 27.0 89527 80.7 12.9 6.3 0.1 5.5
2 1005 Alabama Barbour County 26201 13976 12225 4.2 45.7 47.8 0.2 ... 2.2 1.7 1.3 23.4 8878 74.1 19.1 6.5 0.3 12.4
3 1007 Alabama Bibb County 22580 12251 10329 2.4 74.6 22.0 0.4 ... 0.3 1.7 1.5 30.0 8171 76.0 17.4 6.3 0.3 8.2
4 1009 Alabama Blount County 57667 28490 29177 9.0 87.4 1.5 0.3 ... 0.4 0.4 2.1 35.0 21380 83.9 11.9 4.0 0.1 4.9
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3137 56037 Wyoming Sweetwater County 44527 22981 21546 16.0 79.6 0.8 0.6 ... 2.8 1.3 1.5 20.5 22739 78.4 17.8 3.8 0.0 5.2
3138 56039 Wyoming Teton County 22923 12169 10754 15.0 81.5 0.5 0.3 ... 11.7 3.8 5.7 14.3 14492 82.1 11.4 6.5 0.0 1.3
3139 56041 Wyoming Uinta County 20758 10593 10165 9.1 87.7 0.1 0.9 ... 1.1 1.3 2.0 19.9 9528 71.5 21.5 6.6 0.4 6.4
3140 56043 Wyoming Washakie County 8253 4118 4135 14.2 82.2 0.3 0.4 ... 6.9 1.3 4.4 14.3 3833 69.8 22.0 8.1 0.2 6.1
3141 56045 Wyoming Weston County 7117 3756 3361 1.4 91.6 0.5 0.1 ... 3.0 1.6 6.9 25.7 3407 68.2 21.9 8.8 1.1 2.2

3142 rows × 37 columns

In [8]:
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)
In [9]:
nee = crime['county_name'].unique()
print(len(nee))

ja = census['County'].unique()
print(len(ja))

#Er zijn 3.142 counties in de Verenigde Staten
3136
1877
In [10]:
def convertToFipsForCensus(row):
    countyId = row["CountyId"]

    if countyId >= 10000:
        return str(countyId)

    return "0" + str(countyId)

census["fips"] = census.apply(lambda row: convertToFipsForCensus(row), axis = 1)
census.head()
Out[10]:
CountyId State County TotalPop Men Women Hispanic White Black Native ... OtherTransp WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment fips
0 1001 Alabama Autauga County 55036 26899 28137 2.7 75.4 18.9 0.3 ... 1.3 2.5 25.8 24112 74.1 20.2 5.6 0.1 5.2 01001
1 1003 Alabama Baldwin County 203360 99527 103833 4.4 83.1 9.5 0.8 ... 1.1 5.6 27.0 89527 80.7 12.9 6.3 0.1 5.5 01003
2 1005 Alabama Barbour County 26201 13976 12225 4.2 45.7 47.8 0.2 ... 1.7 1.3 23.4 8878 74.1 19.1 6.5 0.3 12.4 01005
3 1007 Alabama Bibb County 22580 12251 10329 2.4 74.6 22.0 0.4 ... 1.7 1.5 30.0 8171 76.0 17.4 6.3 0.3 8.2 01007
4 1009 Alabama Blount County 57667 28490 29177 9.0 87.4 1.5 0.3 ... 0.4 2.1 35.0 21380 83.9 11.9 4.0 0.1 4.9 01009

5 rows × 38 columns

In [11]:
def createFipsForCrime(row):
    cityFips = str(row["FIPS_CTY"])
    stateFips = str(row["FIPS_ST"])

    if len(cityFips) == 1:
        cityFips = "00" + cityFips

    if len(cityFips) == 2:
        cityFips = "0" + cityFips

    if len(stateFips) == 1:
        stateFips = "0" + stateFips

    return stateFips + cityFips

crime["fips"] = crime.apply(lambda row: createFipsForCrime(row), axis = 1)
crime.head()
Out[11]:
county_name crime_rate_per_100000 index EDITION PART IDNO CPOPARST CPOPCRIM AG_ARRST AG_OFF ... ROBBERY AGASSLT BURGLRY LARCENY MVTHEFT ARSON population FIPS_ST FIPS_CTY fips
0 St. Louis city, MO 1791.995377 1 1 4 1612 318667 318667 15 15 ... 1778 3609 4995 13791 3543 464 318416 29 510 29510
1 Crittenden County, AR 1754.914968 2 1 4 130 50717 50717 4 4 ... 165 662 1482 1753 189 28 49746 5 35 05035
2 Alexander County, IL 1664.700485 3 1 4 604 8040 8040 2 2 ... 5 119 82 184 12 2 7629 17 3 17003
3 Kenedy County, TX 1456.310680 4 1 4 2681 444 444 1 1 ... 1 2 5 4 4 0 412 48 261 48261
4 De Soto Parish, LA 1447.402430 5 1 4 1137 26971 26971 3 3 ... 17 368 149 494 60 0 27083 22 31 22031

5 rows × 25 columns

In [12]:
crime_census = census.merge(crime, how="left", on="fips")

display(crime_census.head())
display(crime_census.shape)
print('----------------------------------------')
display(crime_census.isnull().sum())
print('--------------------------------------------------------------------------------')
print(crime_census.columns)
CountyId State County TotalPop Men Women Hispanic White Black Native ... RAPE ROBBERY AGASSLT BURGLRY LARCENY MVTHEFT ARSON population FIPS_ST FIPS_CTY
0 1001 Alabama Autauga County 55036 26899 28137 2.7 75.4 18.9 0.3 ... 15.0 34.0 87.0 447.0 1233.0 85.0 108.0 55246.0 1.0 1.0
1 1003 Alabama Baldwin County 203360 99527 103833 4.4 83.1 9.5 0.8 ... 30.0 76.0 332.0 967.0 3829.0 192.0 31.0 195540.0 1.0 3.0
2 1005 Alabama Barbour County 26201 13976 12225 4.2 45.7 47.8 0.2 ... 4.0 8.0 36.0 90.0 362.0 21.0 0.0 27076.0 1.0 5.0
3 1007 Alabama Bibb County 22580 12251 10329 2.4 74.6 22.0 0.4 ... 4.0 8.0 36.0 122.0 251.0 27.0 0.0 22512.0 1.0 7.0
4 1009 Alabama Blount County 57667 28490 29177 9.0 87.4 1.5 0.3 ... 11.0 9.0 101.0 397.0 865.0 86.0 9.0 57872.0 1.0 9.0

5 rows × 62 columns

(3142, 62)
----------------------------------------
CountyId      0
State         0
County        0
TotalPop      0
Men           0
             ..
MVTHEFT       9
ARSON         9
population    9
FIPS_ST       9
FIPS_CTY      9
Length: 62, dtype: int64
--------------------------------------------------------------------------------
Index(['CountyId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic',
       'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen',
       'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
       'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
       'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
       'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
       'SelfEmployed', 'FamilyWork', 'Unemployment', 'fips', 'county_name',
       'crime_rate_per_100000', 'index', 'EDITION', 'PART', 'IDNO', 'CPOPARST',
       'CPOPCRIM', 'AG_ARRST', 'AG_OFF', 'COVIND', 'INDEX', 'MODINDX',
       'MURDER', 'RAPE', 'ROBBERY', 'AGASSLT', 'BURGLRY', 'LARCENY', 'MVTHEFT',
       'ARSON', 'population', 'FIPS_ST', 'FIPS_CTY'],
      dtype='object')
In [13]:
def showMap(df: pd.DataFrame, counties, target: str, colorscheme=px.colors.diverging.PRGn, min=5, max=50):
    fig = px.choropleth(df, geojson=counties, locations='fips', color=target,
                        range_color=(min, max),
                        scope="usa",
                        labels={target: f'{target}'}, 
                        color_continuous_midpoint=((max - min) / 3) + min, 
                        color_continuous_scale=colorscheme
                        )
    fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})

    return fig
In [14]:
minUnemployment = crime_census["crime_rate_per_100000"].min()
maxUnemployment = crime_census["crime_rate_per_100000"].max()
print(minUnemployment, maxUnemployment)
0.0 1791.995377
In [15]:
crime_census['black_hispanic'] = crime_census['Black'] + crime_census['Hispanic']
In [16]:
crimes_per = showMap(crime_census, counties, "crime_rate_per_100000", min=0, max=400, colorscheme='Reds')
crimes_per.show()
In [17]:
hispan = showMap(crime_census, counties, "VotingAgeCitizen", min=0, max=70000, colorscheme='emrld')
hispan.show()

print(crime_census['VotingAgeCitizen'].max(), crime_census['VotingAgeCitizen'].min())
6218279 59
In [18]:
total_pop = showMap(crime_census, counties, "TotalPop", min=0, max=75000, colorscheme='Greens')
total_pop.show()

print(crime_census['TotalPop'].max(), crime_census['TotalPop'].min())
10105722 74
In [19]:
income_cap = showMap(crime_census, counties, "Unemployment", min=0, max=10, colorscheme='greys')
income_cap.show()

print(crime_census['Unemployment'].max(), crime_census['Unemployment'].min())
28.8 0.0
In [20]:
child_pov = showMap(crime_census, counties, "Poverty", min=0, max=25, colorscheme='sunset')
child_pov.show()

print(crime_census['Poverty'].max(), crime_census['Poverty'].min())
52.0 2.4
In [21]:
income_cap = showMap(crime_census, counties, "IncomePerCap", min=9000, max=25000, colorscheme='mint')
income_cap.show()

print(crime_census['IncomePerCap'].max(), crime_census['IncomePerCap'].min())
69529 9334
In [22]:
# Verwijder rijen met ontbrekende waarden in de kolom 'crime_rate_per_100000'
crime_census_cleaned = crime_census.dropna(subset=['crime_rate_per_100000', 'ChildPoverty'])
In [25]:
from sklearn.linear_model import LinearRegression 
plt.figure(figsize=(10, 6))
sns.scatterplot(data=crime_census_cleaned, x='ChildPoverty', y='crime_rate_per_100000', marker='o', color='b')

plt.title('Correlatie tussen Crime Rate per 100,000 en Unemployment')
plt.xlabel('Kinderarmoede')
plt.ylabel('Criminaliteitscijfer per 100.000')

x = crime_census_cleaned[['ChildPoverty']]
y = crime_census_cleaned['crime_rate_per_100000']

reg = LinearRegression()
reg.fit(x, y)
predictions = reg.predict(x)

plt.plot(x, predictions, color='r')

plt.show()
No description has been provided for this image

['CountyId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen', 'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty', 'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction', 'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed', 'FamilyWork', 'Unemployment', 'fips', 'county_name', 'crime_rate_per_100000', 'index', 'EDITION', 'PART', 'IDNO', 'CPOPARST', 'CPOPCRIM', 'AG_ARRST', 'AG_OFF', 'COVIND', 'INDEX', 'MODINDX', 'MURDER', 'RAPE', 'ROBBERY', 'AGASSLT', 'BURGLRY', 'LARCENY', 'MVTHEFT', 'ARSON', 'population', 'FIPS_ST', 'FIPS_CTY']

'aggrnyl', 'agsunset', 'algae', 'amp', 'armyrose', 'balance', 'blackbody', 'bluered', 'blues', 'blugrn', 'bluyl', 'brbg', 'brwnyl', 'bugn', 'bupu', 'burg', 'burgyl', 'cividis', 'curl', 'darkmint', 'deep', 'delta', 'dense', 'earth', 'edge', 'electric', 'emrld', 'fall', 'geyser', 'gnbu', 'gray', 'greens', 'greys', 'haline', 'hot', 'hsv', 'ice', 'icefire', 'inferno', 'jet', 'magenta', 'magma', 'matter', 'mint', 'mrybm', 'mygbm', 'oranges', 'orrd', 'oryel', 'oxy', 'peach', 'phase', 'picnic', 'pinkyl', 'piyg', 'plasma', 'plotly3', 'portland', 'prgn', 'pubu', 'pubugn', 'puor', 'purd', 'purp', 'purples', 'purpor', 'rainbow', 'rdbu', 'rdgy', 'rdpu', 'rdylbu', 'rdylgn', 'redor', 'reds', 'solar', 'spectral', 'speed', 'sunset', 'sunsetdark', 'teal', 'tealgrn', 'tealrose', 'tempo', 'temps', 'thermal', 'tropic', 'turbid', 'turbo', 'twilight', 'viridis', 'ylgn', 'ylgnbu', 'ylorbr', 'ylorrd'

In [27]:
correlation = crime_census['Income'].corr(crime_census['crime_rate_per_100000'], method='pearson')
print("Correlatie tussen het inkomen en de criminaliteits tarief:", round(correlation,2))
Correlatie tussen het inkomen en de criminaliteits tarief: -0.14

Part B

In [28]:
# Download the COVID-19 dataset
covid_url = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
covid_df = pd.read_csv(covid_url)

# Convert the date column to datetime format
covid_df['date'] = pd.to_datetime(covid_df['date'])

# Filter data for the year 2023
covid_df_2023 = covid_df[covid_df['date'].dt.year == 2023]

# Sort by date to ensure the latest data is selected for each country
covid_df_sorted = covid_df_2023.sort_values('date')

# Group by country and get the last entry for each country
covid_df_aggregated = covid_df_sorted.groupby('location').last().reset_index()

# Select relevant columns
covid_selected_columns = ['location', 'total_cases', 'total_deaths', 'total_vaccinations', 'population']
covid_df_aggregated = covid_df_aggregated[covid_selected_columns]

# Rename columns for clarity
covid_df_aggregated.rename(columns={'location': 'country'}, inplace=True)

# Display the aggregated DataFrame
display(covid_df_aggregated)
country total_cases total_deaths total_vaccinations population
0 Afghanistan 230375.0 7973.0 2.296475e+07 4.112877e+07
1 Africa 13133432.0 259066.0 8.632379e+08 1.426737e+09
2 Albania 334596.0 3604.0 3.088966e+06 2.842318e+06
3 Algeria 272010.0 6881.0 NaN 4.490323e+07
4 American Samoa 8359.0 34.0 NaN 4.429500e+04
... ... ... ... ... ...
248 Wallis and Futuna 3550.0 8.0 1.805800e+04 1.159600e+04
249 World 773948532.0 7015947.0 1.357576e+10 7.975105e+09
250 Yemen 11945.0 2159.0 1.298654e+06 3.369661e+07
251 Zambia 349304.0 4069.0 1.345421e+07 2.001767e+07
252 Zimbabwe 266071.0 5731.0 NaN 1.632054e+07

253 rows × 5 columns

https://www.kaggle.com/code/hasibalmuzdadid/world-population-analysis/input

In [29]:
df2 = pd.read_csv('world_population.csv')
display(df2.head(), df2.isnull().sum())
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[29], line 1
----> 1 df2 = pd.read_csv('world_population.csv')
      2 display(df2.head(), df2.isnull().sum())

File /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser

File /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle

File /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.12/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    868 elif isinstance(handle, str):
    869     # Check whether the filename is to be opened in binary mode.
    870     # Binary mode does not support 'encoding' and 'newline'.
    871     if ioargs.encoding and "b" not in ioargs.mode:
    872         # Encoding
--> 873         handle = open(
    874             handle,
    875             ioargs.mode,
    876             encoding=ioargs.encoding,
    877             errors=errors,
    878             newline="",
    879         )
    880     else:
    881         # Binary mode
    882         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'world_population.csv'
In [ ]:
df3 = covid_df_aggregated.merge(df2, left_on='country', right_on='Country/Territory', how='inner')
display(df3.head())
In [ ]:
correlation = df3['total_cases'].corr(df3['Density (per km²)'], method='pearson')
print("Pearson's correlation coefficient:", round(correlation,2))
In [ ]:
correlation = df3['total_cases'].corr(df3['Area (km²)'], method='pearson')
print("Pearson's correlation coefficient:", round(correlation,2))
In [ ]:
correlation = df3['total_cases'].corr(df3['population'], method='pearson')
print("Pearson's correlation coefficient:", round(correlation,2))